Our data contains real estate data from portions of North Carolina’s Wake County, Chatham County and Durham County. The dataset contains about 270,000 observations and 63 variables containing real estate data - including but not limited to houses, apartments and commercial property. After cleaning our data, the dataset only contains data from Wake County and contains about 64,000 observations and 22 variables.
Our responsibilities are divided as follows:
| Name | Description |
|---|---|
| Calculated.Acreage | Calculated area from property in acres |
| Land.Class | Land classification description |
| Land.Class.Code | Land classification code |
| Total.Structures | Total number of structures on the property |
| Total.Units | Total number of units on the property |
| Building.Value | Revenue Dept. assessed value for structures contained within the property |
| Land.Value | Revenue Dept. assessed value for land contained within the property |
| Land.Sale.Value | US dollar value for the land when last sold. |
| Land.Sale.Date | Date that land was last sold. |
| Total.Sale.Value | US dollar value for the land and building(s) when last sold. |
| Total.Sale.Date | Date that land and/or building(s) when last sold. |
| WC.ETJ | Corporate limits where property is located. |
| Billing.Class | Billing classifications for Revenue Department use. |
| APA.Ownership.Description | American Planning Association (APA) ownership description. |
| APA.Activity.Description | APA activity description. |
| APA.Function.Description | APA function description. |
| APA.Site.Description | APA site description. |
| Total.Building.Square.Footage | Total square footage of the structure. |
| Type.And.Use.Description | Building use and type description |
| Phy.City | City where property is located. |
| Shape.STArea | Property structure area. |
| Year.Built | Year property was built. |
Source: Town of Cary Dataset Schema
My goal for this section is to model what maximizes the total sale
value of a plot of land from this dataset. Initially I wanted to include
both numeric and categorical data in my model but I would run into some
problems – more on that later. Using the cleaned dataset, and all
possible subsets, I picked out seven regressors and compared them and
their criterion to each other: Calculated.Acreage,
Total.Structures, Total.Units,
Building.Value, Land.Value,
Land.Sale.Value, and
Total.Building.Square.Footage. The following shows these
results.
| Acreage | Sructures | Units | Building | Land | Land Sale | Sq. Footage | AdjR2 | Cp | BIC |
|---|---|---|---|---|---|---|---|---|---|
| * | 0.7452296 | 9596.42695 | -93695.7 | ||||||
| * | * | 0.7725733 | 1212.01377 | -101466.9 | |||||
| * | * | * | 0.7740828 | 750.09264 | -101913.2 | ||||
| * | * | * | * | 0.7756057 | 284.09159 | -102366.6 | |||
| * | * | * | * | * | 0.7761798 | 109.05205 | -102532.0 | ||
| * | * | * | * | * | * | 0.7764352 | 31.72609 | -102600.2 | |
| * | * | * | * | * | * | * | 0.7765158 | 8.00000 | -102614.8 |
As shown above, each criterion points to the fact that the full model is the most accurate model. This seems like a great fit to the data – however, it’s not. The model’s variance inflation factor (VIF) and the data’s scatterplot matrix, as shown to the right, proves that there is collinearity between the regressors.
Calculated.Acreage Total.Structures
2.419908 3.413634
Total.Units Building.Value
2.232743 11.698253
Land.Value Land.Sale.Value
9.004143 2.016888
Total.Building.Square.Footage
16.285497
This calls for a reduced model.
Our new best model regresses Total.Sale.Value with
Calculated.Acreage, Total.Structures,
Total.Units, and Land.Sale.Value. The model
summary and VIF is shown below.
Call:
lm(formula = Total.Sale.Value ~ Calculated.Acreage + Total.Structures +
Total.Units + Land.Sale.Value, data = df_mlr)
Residuals:
Min 1Q Median 3Q Max
-65277755 -119735 -20383 85065 95479184
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.405e+06 1.511e+04 -93.01 <2e-16 ***
Calculated.Acreage 3.610e+05 6.232e+03 57.93 <2e-16 ***
Total.Structures 1.591e+06 1.532e+04 103.88 <2e-16 ***
Total.Units 3.762e+04 4.702e+02 79.99 <2e-16 ***
Land.Sale.Value 8.212e-01 1.722e-02 47.69 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1379000 on 68532 degrees of freedom
Multiple R-squared: 0.6066, Adjusted R-squared: 0.6066
F-statistic: 2.642e+04 on 4 and 68532 DF, p-value: < 2.2e-16
Calculated.Acreage Total.Structures Total.Units Land.Sale.Value
1.931941 2.333885 1.820155 1.285553
The calculated acreage, the land sale value, and the total number of structures and units have a direct correlation against the total sale value of a property. This model explains about 60.66% of the variance, as seen by the adjusted-\(R^2\) value.
However, there’s a problem: none of the assumptions for multiple
linear regression are satisfied, and a log transform is necessary. The
only problem that comes with this is that Total.Units
cannot be used anymore since a large portion of the data contains
zero.
The final model is as follows:
Call:
lm(formula = log(Total.Sale.Value) ~ log(Calculated.Acreage) +
log(Total.Structures) + log(Land.Sale.Value), data = df_mlr)
Residuals:
Min 1Q Median 3Q Max
-9.0474 -0.2003 -0.0033 0.2204 4.4988
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.859083 0.030204 227.09 <2e-16 ***
log(Calculated.Acreage) 0.101919 0.002066 49.33 <2e-16 ***
log(Total.Structures) 0.713479 0.018643 38.27 <2e-16 ***
log(Land.Sale.Value) 0.554229 0.002730 203.01 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4743 on 68533 degrees of freedom
Multiple R-squared: 0.4493, Adjusted R-squared: 0.4493
F-statistic: 1.864e+04 on 3 and 68533 DF, p-value: < 2.2e-16
Looking at the graphs to the right (aside from collinearity), you can see a side-by-side comparison showing how the log transform has satisfied all assumptions but normality. With this in mind, this model might not produce the most accurate or precise results. Nonparametric methods should be considered when asking this question.
As expected, the value of the land is directly correlated to to the Total Sale Value. The relationship does not appear to be exactly linear, but has a logarithmic curve.
7 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -3.366018e+05
Calculated.Acreage -1.433054e+04
Total.Structures 3.624719e+05
Total.Units 9.789264e+03
Building.Value 6.350616e-01
Land.Value 1.698934e+00
Land.Sale.Value 3.875537e-03
[1] 0.7719416
Let’s pretend that you’re a property developer building single-family houses and you bought a piece of land and want to estimate how much money your residents are going to pay per month in rent. Over the span of three years, about how much will they need to pay per month for your net income to break even?
I wanted to view the relationship between the land value, acreage and total sale value of a plot of residential land. Since Naive Bayes is much better suited for categorical data than numeric data, I created intervals for these variables. I would calculate the estimated rent as follows:
\[\hbox{Rent} = \frac{1}{36} \, \left(\hbox{Land Value - Total Sale Value}\right)\]
Below is a table summarizing these intervals:
| Interval | Description |
|---|---|
| Calculated | “Small” if smaller than 0.4 acres, “Large” if greater |
| Land Value | Less than $50k, between $50k and $100k or over $100k |
| Total Sale Value | Less than $200k, between $200k and $400k or over $400k |
| Estimated Rent | Less than $5k, between $5k and $10k or over $10k |
With this in mind, the output of the confusion matrix is shown below:
Confusion Matrix and Statistics
Actual
Predicted <5k >10k 5k-10k
<5k 2882 9 14
>10k 23 5840 1936
5k-10k 2427 79 7051
Overall Statistics
Accuracy : 0.7785
95% CI : (0.7727, 0.7842)
No Information Rate : 0.4443
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.6539
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: <5k Class: >10k Class: 5k-10k
Sensitivity 0.5405 0.9852 0.7834
Specificity 0.9985 0.8633 0.7774
Pos Pred Value 0.9921 0.7488 0.7378
Neg Pred Value 0.8588 0.9929 0.8178
Prevalence 0.2632 0.2926 0.4443
Detection Rate 0.1422 0.2882 0.3480
Detection Prevalence 0.1434 0.3849 0.4717
Balanced Accuracy 0.7695 0.9242 0.7804
What makes a piece of property expensive? I wanted to classify
---
title: "Property Sale Value Analysis"
author: "Marcos Fassio Bazzi, Ariq Rashid"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
source_code: embed
theme: yeti
---
```{r setup, include=FALSE}
library(flexdashboard)
library(knitr)
library(olsrr)
library(leaps)
library(faraway)
library(GGally)
library(tidyverse)
library(glmnet)
library(ISLR2)
library(splines)
library(caret)
library(FNN)
library(e1071)
library(gmodels)
library(car)
library(arm)
library(boot)
```
```{r}
##CHANGE THIS WORKING DIRECTORY--------------------
setwd("C:/Users/azras/OneDrive/Documents/School/VT/Fall 2023/Intermediate Data and ML - CMDA 4654/Project 1")
```
```{r}
df <- read.csv("data/property.csv", header=TRUE)
wake_cols <- c(10, 18:21, 24:29, 34, 43:46, 48, 50, 52, 53, 59, 61) # columns i want in cleaned data
useless_APA_cols <- c("Site In Natural State", "Developing Site") # rows i DONT want in cleaned data
useless_land_class_cols <- c("Part Exempt", "State Assessed", "Vacant Land", "Manufactured Home",
"Manufactured Home Park")
useless_ownership_cols <- c("County, Parish, Province, Etc.", "Federal Government")
df_w <- df %>%
dplyr::select(all_of(wake_cols)) %>%
mutate_if(is.character, list(~na_if(.,""))) %>%
filter(Land.Value != 0 & Building.Value != 0 & Land.Sale.Value > 0 & Total.Sale.Value > 0) %>%
filter(!APA.Site.Description %in% useless_APA_cols) %>%
filter(!Land.Class %in% useless_land_class_cols) %>%
filter(!APA.Ownership.Description %in% useless_ownership_cols) %>%
# only one instance found - essentially an exception
drop_na()
```
Overview
=======================================================================
data summary {data-width=350}
-----------------------------------------------------------------------
### About our Data
Our data contains real estate data from portions of North Carolina's Wake County, Chatham County and Durham County. The dataset contains about 270,000 observations and 63 variables containing real estate data - including but not limited to houses, apartments and commercial property. After cleaning our data, the dataset only contains data from Wake County and contains about 64,000 observations and 22 variables.
* Source: [data.gov](https://catalog.data.gov/dataset/real-estate-data)
* Landing page: [townofcary.org](https://data.townofcary.org/explore/dataset/property/information)
* Raw Data: [Google Drive](https://drive.google.com/file/d/15Gm2mkMeiIoca7K-YL8wG-0lwqsXrdJG/view?usp=sharing).
Our responsibilities are divided as follows:
* Ariq:
- Natural Cubic Splines
- k-Nearest Neighbors Classification
- Ridge Regression
* Marcos:
- Multiple Linear Regression
- Naive Bayes Classification
- Logistic Regression
Table Summary {data-width=650}
-----------------------------------------------------------------------
### Table Summary
| **Name** | **Description** |
|:-----------------------------:|:-------------------------------------------------------------------------:|
| Calculated.Acreage | Calculated area from property in acres |
| Land.Class | Land classification description |
| Land.Class.Code | Land classification code |
| Total.Structures | Total number of structures on the property |
| Total.Units | Total number of units on the property |
| Building.Value | Revenue Dept. assessed value for structures contained within the property |
| Land.Value | Revenue Dept. assessed value for land contained within the property |
| Land.Sale.Value | US dollar value for the land when last sold. |
| Land.Sale.Date | Date that land was last sold. |
| Total.Sale.Value | US dollar value for the land and building(s) when last sold. |
| Total.Sale.Date | Date that land and/or building(s) when last sold. |
| WC.ETJ | Corporate limits where property is located. |
| Billing.Class | Billing classifications for Revenue Department use. |
| APA.Ownership.Description | American Planning Association (APA) ownership description. |
| APA.Activity.Description | APA activity description. |
| APA.Function.Description | APA function description. |
| APA.Site.Description | APA site description. |
| Total.Building.Square.Footage | Total square footage of the structure. |
| Type.And.Use.Description | Building use and type description |
| Phy.City | City where property is located. |
| Shape.STArea | Property structure area. |
| Year.Built | Year property was built. |
**Source: Town of Cary Dataset Schema**
Multiple Linear Regression
=======================================================================
summary {.sidebar}
-----------------------------------------------------------------------
### Summary
What maximizes the total sale value of a piece of property? After comparing all possible subsets and their criterion and accounting for collinearity, the model regressing the total sale value to the calculated acreage, total structures, total units and land sale value is our most accurate model.
However, the model violated all assumptions. Applying a log transform satisfies every assumption but normal distribution. You should note that this model might not produce the most accurate or precise results and should consider nonparametric methods and models.
Side-by-side comparisons are provided in the "Variance", "Normality" and "Leverage" tabs.
Column {data-width=500, .tabset}
-----------------------------------------------------------------------
### Research & Model Selection
```{r}
mlr_cols <- c(1, 4:8, 10, 18)
df_mlr <- df_w %>%
dplyr::select(all_of(mlr_cols))
```
My goal for this section is to model what maximizes the total sale value of a plot of land from this dataset. Initially I wanted to include both numeric and categorical data in my model but I would run into some problems -- more on that later. Using the cleaned dataset, and all possible subsets, I picked out seven regressors and compared them and their criterion to each other: `Calculated.Acreage`, `Total.Structures`, `Total.Units`, `Building.Value`, `Land.Value`, `Land.Sale.Value`, and `Total.Building.Square.Footage`. The following shows these results.
```{r}
best_subsets <- regsubsets(Total.Sale.Value ~ ., data=df_mlr)
best_subsets_results <- summary(best_subsets)
attach(best_subsets_results)
table <- data.frame(outmat, adjr2, cp, bic)
colnames(table) <- c("Acreage", "Sructures", "Units", "Building",
"Land", "Land Sale", "Sq. Footage", "AdjR2", "Cp", "BIC")
rownames(table) <- c(1, 2, 3, 4, 5, 6, 7)
knitr::kable(table)
detach(best_subsets_results)
```
As shown above, each criterion points to the fact that the full model is the most accurate model. This *seems* like a great fit to the data -- however, it's not. The model's variance inflation factor (VIF) and the data's scatterplot matrix, as shown to the right, proves that there is collinearity between the regressors.
```{r}
full_model <- lm(Total.Sale.Value ~ ., data=df_mlr)
vif(full_model)
```
This calls for a reduced model.
### Reduced Model & Diagnostics
Our new best model regresses `Total.Sale.Value` with `Calculated.Acreage`, `Total.Structures`, `Total.Units`, and `Land.Sale.Value`. The model summary and VIF is shown below.
```{r}
reduced_model <- lm(Total.Sale.Value ~ Calculated.Acreage + Total.Structures + Total.Units + Land.Sale.Value,
data=df_mlr)
summary(reduced_model)
vif(reduced_model)
```
The calculated acreage, the land sale value, and the total number of structures and units have a direct correlation against the total sale value of a property. This model explains about 60.66% of the variance, as seen by the adjusted-$R^2$ value.
However, there's a problem: none of the assumptions for multiple linear regression are satisfied, and a log transform is necessary. The only problem that comes with this is that `Total.Units` cannot be used anymore since a large portion of the data contains zero.
### Final Model
The final model is as follows:
```{r}
final_model <- lm(log(Total.Sale.Value) ~ log(Calculated.Acreage) + log(Total.Structures) +
log(Land.Sale.Value), data=df_mlr)
summary(final_model)
```
Looking at the graphs to the right (aside from collinearity), you can see a side-by-side comparison showing how the log transform has satisfied all assumptions but normality. With this in mind, this model might not produce the most accurate or precise results. Nonparametric methods should be considered when asking this question.
Column {data-width=500 .tabset}
-----------------------------------------------------------------------
### Collinearity
```{r}
temp_df <- df_mlr
colnames(temp_df) <- c("CA", "TS", "TU", "BV", "LV", "LSV", "TSV", "TBSF")
ggpairs(temp_df[1:10000, ]) + # i will NOT let R run 70000 data points 49 times.
theme(axis.line=element_blank(), axis.text=element_blank(), axis.ticks=element_blank())
```
### Variance
```{r}
par(mfrow=c(2,1))
plot(reduced_model, which=1, pch=20, main="Full Model")
plot(final_model, which=1, pch=20, main="Log Model")
```
### Normality
```{r}
par(mfrow=c(2,1))
plot(reduced_model, which=2, pch=20, main="Full Model")
plot(final_model, which=2, pch=20, main="Log Model")
```
### Leverage
```{r}
par(mfrow=c(2,1))
plot(reduced_model, which=5, pch=20, main="Full Model")
plot(final_model, which=5, pch=20, main="Log Model")
```
Natural Cubic Splines
=======================================================================
### How Land Value affects Total Sale Value
```{r, fig.width=10, fig.height=7}
#Creating dataframe for plot
ns_filtered <- df_w %>%
filter(Land.Value < 250000, Total.Sale.Value < 900000) %>%
dplyr::select(Land.Value, Total.Sale.Value)
ns_df <- ns_filtered %>% sample_n(3000)
#Creating NS with different dfs
ns.1 <- lm(Total.Sale.Value ~ ns(Land.Value, df= 1), data = ns_df)
ns.2 <- lm(Total.Sale.Value ~ ns(Land.Value, df= 2), data = ns_df)
ns.3 <- lm(Total.Sale.Value ~ ns(Land.Value, df= 3), data = ns_df)
ns.4 <- lm(Total.Sale.Value ~ ns(Land.Value, df= 4), data = ns_df)
ns.5 <- lm(Total.Sale.Value ~ ns(Land.Value, df= 5), data = ns_df)
#Determining which df is best
SSE.1 <- sum(residuals(ns.1)^2)
SSE.2 <- sum(residuals(ns.2)^2)
SSE.3 <- sum(residuals(ns.3)^2)
SSE.4 <- sum(residuals(ns.4)^2)
SSE.5 <- sum(residuals(ns.5)^2)
SSE.1 <- c(SSE.1, SSE.2, SSE.3, SSE.4, SSE.5)
#From plotting separately, degree 3 seems best
#Plot
ggplot(data = ns_df, aes(x = Land.Value, y = Total.Sale.Value)) +
geom_point(size = 0.9, color = "royalblue") +
geom_smooth(method = "lm", formula = y ~ ns(x, df = 3), se = TRUE,
color = "brown4", size=2) +
geom_vline(xintercept = attributes(ns(ns_df$Land.Value, df = 3))$knots,
linetype = "dashed", color = "black", size = 1) +
ggtitle("Natural Cubic Spline on Land Value vs Total Sale Value") +
labs(x = "Land Value ($)", y = "Total Sale Value ($)") +
theme_minimal() +
theme(plot.title = element_text(size = 18, hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
legend.position = "none")
```
As expected, the value of the land is directly correlated to to the Total Sale Value. The relationship does not appear to be exactly linear, but has a logarithmic curve.
Ridge Regression
=======================================================================
Summary {.sidebar}
-----------------------------------------------------------------------
### Summary
This is the Summary for Ridge Regression
Column {data-width=500, .tabset}
-----------------------------------------------------------------------
### Building First Model
```{r}
y <- df_w$Total.Sale.Value
x <- data.matrix(df_w[, c('Calculated.Acreage', 'Total.Structures', 'Total.Units', 'Building.Value','Land.Value','Land.Sale.Value')])
ridge_model1 <- glmnet(x, y, alpha = 0)
cv_model1 <- cv.glmnet(x, y, alpha = 0)
best_lambda1 <- cv_model1$lambda.min
plot(cv_model1)
```
### Coeffecients
```{r}
best_ridge_model <- glmnet(x, y, alpha = 0, lambda = best_lambda1)
coef(best_ridge_model)
plot(ridge_model1, xvar = "lambda")
```
### R-Squared
```{r}
#use fitted best model to make predictions
y_predicted <- predict(ridge_model1, s = best_lambda1, newx = x)
#find SST and SSE
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)
#find R-Squared
rsq <- 1 - sse/sst
rsq
```
Naive Bayes
=======================================================================
```{r}
nb_cols <- c(1, 2, 4:8, 10, 15)
df_nb <- df_w %>%
dplyr::select(all_of(nb_cols)) %>%
filter(APA.Activity.Description %in% "Household Activities" &
Land.Class %in% "Residential < 10 Acres" & Total.Structures == 1) %>%
mutate(Calculated.Acerage.Class = case_when(Calculated.Acreage > 0.4 ~ "Large",
Calculated.Acreage <= 0.4 ~ "Small")) %>%
mutate(Land.Value.Class = case_when(Land.Value < 50000 ~ "<50k",
Land.Value >= 50000 & Land.Value < 100000 ~ "50k-100k",
Land.Value >= 100000 ~ ">100k")) %>%
mutate(Total.Sale.Class = case_when(Total.Sale.Value < 200000 ~ "<200k",
Total.Sale.Value >= 200000 & Total.Sale.Value < 400000 ~ "200k-400k",
Total.Sale.Value >= 400000 ~ ">400k")) %>%
mutate(Estimated.Rent = round(abs(Land.Value - Total.Sale.Value) / 36), .before=10) %>%
mutate(Estimated.Rent.Class = case_when(Estimated.Rent < 5000 ~ "<5k",
Estimated.Rent >= 5000 & Estimated.Rent < 10000 ~ "5k-10k",
Estimated.Rent >= 10000 ~ ">10k"))
n <- createDataPartition(y=df_nb$Estimated.Rent.Class, p=0.7, list=FALSE)
nb_train <- df_nb[n, ]
nb_test <- df_nb[-n, ]
model <- naiveBayes(Estimated.Rent.Class ~ Calculated.Acerage.Class + Land.Value.Class + Total.Sale.Class,
data=nb_train, laplace=0.5)
yhat <- predict(model, newdata=nb_test)
confusion_matrix <- confusionMatrix(factor(yhat), factor(nb_test$Estimated.Rent.Class),
dnn=c("Predicted", "Actual"))
```
Summary {.sidebar}
-----------------------------------------------------------------------
### Summary
I built a Naive Bayes classifier to classify the amount of money a family will pay per month in rent based on acreage, land value and total sale value. Since Naive Bayes is a categorical classifier by nature, I created intervals for each numerical variable and classified my data using this model.
My most accurate model classified the calculated acreage, land value, and total sale value of the property with an accuracy rate of 78%.
Column
-----------------------------------------------------------------------
### Accuracy rate
```{r}
valueBox("77%", icon="ion-android-checkmark-circle")
```
### Observations tested
```{r}
valueBox(20261, icon="ion-wrench")
```
### Training/testing split
```{r}
valueBox("70-30%", icon="ion-android-clipboard")
```
### Confusion Matrix
```{r}
temp <- as.data.frame(confusion_matrix$table)
colnames(temp)[3] <- "Frequency"
temp$Predicted <- factor(temp$Predicted, levels=rev(levels(temp$Predicted)))
ggplot(temp, aes(Predicted, Actual, fill=Frequency)) + theme_bw() +
geom_tile() + geom_text(aes(label=Frequency)) +
scale_fill_gradient(low="white", high="#43c3f4") +
labs(title="Predicted vs. Actual Estimated Rent per Month")
```
Column
-----------------------------------------------------------------------
### Research Question & Classification Process
Let's pretend that you're a property developer building single-family houses and you bought a piece of land and want to estimate how much money your residents are going to pay per month in rent. Over the span of three years, about how much will they need to pay per month for your net income to break even?
I wanted to view the relationship between the land value, acreage and total sale value of a plot of residential land. Since Naive Bayes is much better suited for categorical data than numeric data, I created intervals for these variables. I would calculate the estimated rent as follows:
$$\hbox{Rent} = \frac{1}{36} \, \left(\hbox{Land Value - Total Sale Value}\right)$$
Below is a table summarizing these intervals:
| **Interval** | **Description** |
|:----------------:|:------------------------------------------------------:|
| Calculated | "Small" if smaller than 0.4 acres, "Large" if greater |
| Land Value | Less than $50k, between $50k and $100k or over $100k |
| Total Sale Value | Less than $200k, between $200k and $400k or over $400k |
| Estimated Rent | Less than $5k, between $5k and $10k or over $10k |
With this in mind, the output of the confusion matrix is shown below:
```{r}
confusion_matrix
```
kNN Classification
=======================================================================
summary {.sidebar}
-----------------------------------------------------------------------
### Summary
Now that we confirmed that Land Value does a large, positive impact on the Total Sale Value, I wanted to see if some other factors could contribute to the Total Sale as well. A column that stood out to me was "Year Built". I assumed this would have some direct impact as well, since properties and estates gradually become more valuable over time. I wanted to see if based off the Sale and Land value, I could predict what time period the property was built. Since there is a wide range of years, I created a new category and binned year ranges together. Before 1995 would be "Old", between 1995 and 2018 would be "Modern", and anything past 2019 would be "New".
Column
---------------------------------------------------------------------------
### Predicted Time Periods on Reduced Data
```{r}
years_binned <- df_w %>%
filter(Land.Value < 250000, Total.Sale.Value < 900000) %>%
dplyr::select(Land.Value, Total.Sale.Value, Year.Built)
years_binned$Time.Period <- ifelse(years_binned$Year < 1995, "Old(Before 1995)",
ifelse(years_binned$Year >= 1995 & years_binned$Year <= 2019,
"Modern(1995-2019)","New(2019-)"))
desired_columns <- c('Time.Period','Total.Sale.Value','Land.Value')
columns1_df <- years_binned[, desired_columns]
set.seed((42))
knn_df <- years_binned %>% sample_n(5000)
index <- sample(1:nrow(knn_df), round(nrow(knn_df) * 0.7))
training_df <- knn_df[index, ]
testing_df <- knn_df[-index, ]
# Store the training/testing data features
train_features1 <- training_df[, 1:2]
test_features1 <- testing_df[, 1:2]
# Scale the features
train_features <- scale(train_features1)
test_features <- scale(test_features1)
# Store the actual labels (assuming Total.Sale.Value is a factor)
train_classes <- factor(training_df$Time.Period)
test_classes <- factor(testing_df$Time.Period)
knn_classes <- knn(train = train_features, test = test_features,
cl = train_classes, k = 5)
confusion_matrix <- confusionMatrix(data = knn_classes, reference = test_classes)
plot_data <- data.frame(Feature1 = test_features1$Land.Value,
Feature2 = test_features1$Total.Sale.Value,
PredictedClass = knn_classes)
# Create a scatter plot
ggplot(plot_data, aes(x = Feature1, y = Feature2, color = as.factor(PredictedClass))) +
geom_point() +
labs(x = "Land Value ($)", y = "Total Sale Value ($)",
color = "Predicted Time Periods", title = "kNN Classification of Time Period with 76% Accuracy") +
theme_classic()
```
Column
-----------------------------------------------------------------------
### Accuracy rate (%)
```{r}
valueBox(76, icon="ion-android-checkmark-circle")
```
### Observations tested
```{r}
valueBox(5000, icon="ion-wrench")
```
### Training/testing split (%)
```{r}
valueBox("70-30", icon="ion-android-clipboard")
```
Logistic Regression
=======================================================================
Column {.sidebar}
-----------------------------------------------------------------------
### Summary
What makes a piece of property expensive? I used the median property value in Wake County, NC (source: [datausa.io](https://datausa.io/profile/geo/wake-county-nc#:~:text=Median%20household%20income%20in%20Wake,values%20of%20%24211%2C213%20and%20%24194%2C737) as my cutoff point between a property being expensive or not and fit a logistic regression model using all possible subsets. After accounting for multicollinearity, my final model uses the calculated acreage, land value, total structures and units as the predictors. After predicting the response and link values and fitting my logit function onto the graph, I concluded my model has a 14% misclassification rate after using the `cv.glm` function.
Column {data-width=500}
-----------------------------------------------------------------------
### Model Selection
What makes a piece of property expensive? I wanted to classify